Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make search more consistent #24121

Closed
wants to merge 3 commits into from
Closed

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 13, 2017

This is an implementation of #24103. If will fail the tests now. I will fix the tests when there is an agreement that the proposed functionality is OK.
All key assumptions how search now works are given in its docstring

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many unrelated formatting changes in this PR. Also I won't repeat comments I made at #24116. I'll review it again when it's more focused.

or UTF-8 strings). The third argument optionally specifies a starting index. The return
value is a range of indexes where the matching sequence is found, such that `s[search(s,x)] == x`:
or UTF-8 strings).
Alternatively the first and the second arguments may be arrays of `Int8` or `UInt8`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not document this, it really should be an internal function which shouldn't have been exposed. I was preparing a PR to merge search with findfirst which would also handle this.

or UTF-8 strings).
Alternatively the first and the second arguments may be arrays of `Int8` or `UInt8`.
The third argument optionally specifies a starting index.
If it is not a valid index into the first argument then error is thrown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes without saying, I don't think it's worth making the description longer than it already is to mention this.

Alternatively the first and the second arguments may be arrays of `Int8` or `UInt8`.
The third argument optionally specifies a starting index.
If it is not a valid index into the first argument then error is thrown.
The return value when the second argument is a string or a vector of bytes is the samllest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"samllest". "exits".


If the first argument is empty and third argument is not given `0:-1` is returned.

The return value when the second argument is a signle character, a vector or a set of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"signle"

`search(string, 'c')` = `index` such that `string[index] == 'c'`, or `0` if unmatched.
In particular a search for a zero length second argument returns `i:(i-1)`.

If the first argument is empty and third argument is not given `0:-1` is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better give the general rule and use start=start(s) in the signature. People will figure out what happens in special cases.

in(c::Char, s::AbstractString) = (search(s,c)!=0)
search(s::AbstractString, c::Chars) = s == "" ? 0 : search(s, c, start(s))

in(c::Char, s::AbstractString) = search(s,c) != 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plase don't change unrelated lines. Same below. This makes it really painful to review.

function _search_bloom_mask(c)
UInt64(1) << (c & 63)
end

_nthbyte(s::String, i) = codeunit(s, i)
_nthbyte(a::ByteArray, i) = a[i]
_nthbyte(s::String, i) = @inbounds codeunit(s, i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add @inbounds at the call site, and add @propagate_inbounds annotations here.


Similar to [`search`](@ref), but return only the start index at which
the substring is found, or `0` if it is not.
Similar to [`search`](@ref), but return only the start index at which the second argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"substring" was clearer, let's not make the docs worse because of that UInt8 thing.

@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2017

@nalimilan Sorry for messing with the code formatting - lesson taken and I will fix it.

Before I correct this PR I would like a clear recommendation about the issue also discussed in #24116 regarding start argument of search.

Currently:

julia> search("", "asd")
0:-1

and I think this is what users would expect.
On the other hand:

julia> search("", "asd", 10)
0:-1

is something we want to avoid as 10 is an invalid index.

The only problem is what we want to return in search("", "asd") vs. search("", "asd", 1).
My reasoning is that the second form should throw an error. And I assume we have a consensus here as 1 is an invalid index for "". However, I think that search("", "asd") should still return 0:-1. Otherwise users will have to write their code as something like:

if endof(s) >= start(s)
    x = search(s, "some string")
    # do something with x
else
    # do the same thing knowing that no match was found
end

which is painful and error prone as in general user might not know if s is empty.

Remember that my proposal additionally fixes:

julia> search("","")
1:0

which is incorrect for sure. But I would say that 0:-1 should be returned (and not an error thrown in this case).

@nalimilan What do you think is a proper way to go given the above?

@nalimilan
Copy link
Member

I hadn't realized it was such a Catch 22. I'd say we should have a special case for 1 with "", since start("") == 1, and start(s) is the default value for the index.

The only other solution I can think of is to allow search("", "asd") but not search("", "asd", 1), but that would be cumbersome to implement and problematic for users in they want to write generic code with an index falling back to start(s). So making an exception sounds better.

@bkamins
Copy link
Member Author

bkamins commented Oct 16, 2017

Just to make sure I understand your sentence:

So making an exception sounds better.
You are opting to make search on "" always throw an exception (when searching for substrings, regexps and Chars)?

If this is the decision I can implement it and it should go into the documentation then.

@nalimilan
Copy link
Member

That's my preferred solution at this stage, but you may want to way for possible comments at #24103 (comment).

@kshyatt kshyatt added the strings "Strings!" label Oct 16, 2017
@bkamins
Copy link
Member Author

bkamins commented Feb 11, 2018

closing as it is outdated.

@bkamins bkamins closed this Feb 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants